Fully convolutional neural networks (FCNNs) trained on a large number ofimages with strong pixel-level annotations have become the new state of the artfor the semantic segmentation task. While there have been recent attempts tolearn FCNNs from image-level weak annotations, they need additionalconstraints, such as the size of an object, to obtain reasonable performance.To address this issue, we present motion-CNN (M-CNN), a novel FCNN frameworkwhich incorporates motion cues and is learned from video-level weakannotations. Our learning scheme to train the network uses motion segments assoft constraints, thereby handling noisy motion information. When trained onweakly-annotated videos, our method outperforms the state-of-the-art EM-Adaptapproach on the PASCAL VOC 2012 image segmentation benchmark. We alsodemonstrate that the performance of M-CNN learned with 150 weak videoannotations is on par with state-of-the-art weakly-supervised methods trainedwith thousands of images. Finally, M-CNN substantially outperforms recentapproaches in a related task of video co-localization on the YouTube-Objectsdataset.
展开▼